Skip to content

Conversation

dangotbanned
Copy link
Member

@dangotbanned dangotbanned commented Oct 18, 2025

Related issues

Notes

  • Take advantage of
    • pc.unique preserving order
    • structs/lists
  • Questions
    • How much can be performed without collection?

Tasks

@dangotbanned dangotbanned mentioned this pull request Oct 18, 2025
71 tasks
`expected` is now taken from testing the same selector on `main`
@dangotbanned dangotbanned added enhancement New feature or request fix labels Oct 19, 2025
- Already works, but I wanna add some optimizations for the single partition case
- `pc.unique` can be used directly on a lot of `ChunkedArray` types, but `filter` will drop nulls by default, so needs some care if present
Avoids the need for a tempoary composite key column, by using `dictionary_encode` and generating boolean masks based on index position
Left a comment in `selectors` about this issue earlier
Comment on lines +198 to +201
for idx in range(len(arr_dict.dictionary)):
# NOTE: Acero filter doesn't support `null_selection_behavior="emit_null"`
# Is there any reasonable way to do this in Acero?
yield native.filter(pc.equal(pa.scalar(idx), indices))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this for use in over(partition_by=...)?

if so, just as a heads up, we won't be able to accept a solution which involves looping over partitions in python

Copy link
Member Author

@dangotbanned dangotbanned Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh hi Marco, fancy seeing you here 😄

is this for use in over(partition_by=...)?

No this part is just for DataFrame.partition_by("one_column").
The multi-column variant of that is run in the c++ engine.

For over(partition_by=...) I need the partitions put back together and only for a single column vs a full table here.
These are different enough that I'd more likely use union (41d8cc2) and keeping things threaded will probably make more sense.
I'm also thinking ahead for over(*partition_by, order_by=...) which will be inserting one of these nodes into the plan

Edit: I've just updated the description to try and be clearer that these are related problems - but only in a conceptual sense


But to clarify

we won't be able to accept a solution which involves looping over partitions in python

This isn't looping over partitions, it is looping over an index into the partition key.
The index is being used to create a mask from indices - which AFAICT isn't expensive.
At the very least, I'm expecting this to be cheaper than what ArrowGroupBy.__iter__ currently does. Which involves casting + concatenating columns, before filtering on the resulting keys.

But the note ...

Is there any reasonable way to do this in Acero?

... is me leaving myself a TODO to try and benchmark that later 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request fix internal

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants